The plan is to model the excess mortality from the February winter storms in Texas. I saw the BuzzFeed analysis and article. I thought it might be beneficial to implement a more modern time series modeling approach and to make a comparison to large states rather than small neighboring states. Below is what came of it.
Spoiler: my conclusion is similar to that of BuzzFeed though I think the number is probably not as high as the 700 that BuzzFeed estimated. Ultimately, it is hard to tell exactly since there is noise due to random deviations from model expectations and those deviations in large states often amount to 250 deaths weekly in similarly sized states. Thus, the 750 deaths in excess of the model exceeded random expected deviation, but it may only be 500 deaths or so greater than expected. Indeed Florida and California saw deviations in excess of 400 deaths and California had several weeks with greater than 500 deaths above the model.
I obtained the data directly from CDC in the manner described by BuzzFeed so it would be completely independent of the BuzzFeed analysis.
First, I import the data. Then, I perform some overly complicated manipulations to combine New York and New York City so it can be incorporated in the analysis as a full state. I’m sure there is a more parsimonious way to accomplish this task, but I just wanted to finish the analysis. The data that I call training data is the the pre-2020 data and the test data is the data starting from January 1st, 2020. At the end of the below code chunk, I convert it to a time series object to facilitate further analyses.
training.df <- read_csv('data/mortality_data_pre2020.csv')
test.df <- read_csv('data/mortality_data_post2020.csv')
training.df <- training.df %>%
clean_names() %>%
select(-starts_with('flag')) %>%
mutate(week_ending_date = mdy(week_ending_date)) %>%
filter(jurisdiction_of_occurrence != 'United States') %>%
mutate(across(where(is.character), str_remove_all, pattern = fixed(" ")))
tmp.df <- bind_cols((training.df %>% filter(jurisdiction_of_occurrence == 'NewYork') %>% select(colnames(training.df)[1:4])), (training.df %>% filter(jurisdiction_of_occurrence == 'NewYork') %>% select(colnames(training.df)[5:ncol(training.df)])) + (training.df %>% filter(jurisdiction_of_occurrence == 'NewYorkCity') %>% select(colnames(training.df)[5:ncol(training.df)])))
training.df <- training.df %>%
filter(jurisdiction_of_occurrence != 'NewYork' &
jurisdiction_of_occurrence != 'NewYorkCity') %>%
bind_rows(tmp.df) %>%
arrange(jurisdiction_of_occurrence, week_ending_date)
rm(tmp.df)
test.df <- test.df %>%
clean_names() %>%
select(-starts_with('flag')) %>%
mutate(week_ending_date = ymd(week_ending_date)) %>%
filter(jurisdiction_of_occurrence != 'United States') %>%
mutate(across(where(is.character), str_remove_all, pattern = fixed(" ")))
tmp.df <- bind_cols((test.df %>% filter(jurisdiction_of_occurrence == 'NewYork') %>% select(colnames(test.df)[1:4])), (test.df %>% filter(jurisdiction_of_occurrence == 'NewYork') %>% select(colnames(test.df)[5:ncol(test.df)])) + (test.df %>% filter(jurisdiction_of_occurrence == 'NewYorkCity') %>% select(colnames(test.df)[5:ncol(test.df)])))
test.df <- test.df %>%
filter(jurisdiction_of_occurrence != 'NewYork' &
jurisdiction_of_occurrence != 'NewYorkCity') %>%
bind_rows(tmp.df) %>%
arrange(jurisdiction_of_occurrence, week_ending_date)
rm(tmp.df)
train.ts <- training.df %>%
select(week_ending_date, jurisdiction_of_occurrence, all_cause) %>%
pivot_wider(names_from = jurisdiction_of_occurrence, values_from = all_cause)
train.ts <- ts(train.ts %>% select(-week_ending_date),
start = min(train.ts$week_ending_date),
frequency = 52)
Let’s see what the basic all cause mortality data looks like for the state of Texas. Here are the total deaths for each week of the training data.
p.basic <- ggplot(training.df %>% filter(jurisdiction_of_occurrence == 'Texas'),
aes(x = week_ending_date,
y = all_cause)) +
geom_point(color = '#000000',
fill = '#AAAAAA',
shape = 21,
size = 3) +
xlab('week ending date') +
ylab('total number of deaths') +
theme_bw(16)
show(p.basic)
I start with a relatively straightforward seasonal ARIMA model (aka SARIMA). Given the computational complexity (lots of time) of fitting lots of different SARIMA models, I simplified the selection scheme to just identify the best ARIMA model and added a first order seasonal diff with a first order seasonal moving average. For the ARIMA selection scheme, I used AIC with a cutoff of 2 to select the best model. For brevity, that is not shown, but the code is there.
#fits <- find.fit(train.ts[,'Texas'])
#print.top.n(10, fits)
final.sar.fit <- Arima(train.ts[,'Texas'],
order = c(4,1,4),
season = list(order = c(0,1,1), period = 52))
pred.sarima <- forecast(final.sar.fit,
h = 72)
On visual inspection, the fit appears to make sense for Texas. The plot shows the real data and the model prediction based on modeling from old data. The real data includes COVID-related deaths so those will need to be removed.
td.sar <- c(test.df %>% filter(jurisdiction_of_occurrence == 'Texas') %>% select(all_cause))$all_cause
pd.sar <- clean_names(as_tibble(pred.sarima))$point_forecast
whole_series <- c(train.ts[,'Texas'], td.sar, pd.sar)
sar.df <- tibble(date = c(seq.Date(from = min(training.df$week_ending_date),
by = 7,
length.out = length(train.ts[,'Texas'])),
seq.Date(from = min(test.df$week_ending_date),
by = 7,
length.out = length(td.sar)),
seq.Date(from = max(training.df$week_ending_date) + 7,
by = 7,
length.out = length(pd.sar))),
point.val = whole_series,
label = c(rep('real', length(train.ts[,5]) + length(td.sar)),
rep('model', length(pd.sar))))
p.pred.sar <- ggplot(sar.df, aes(x = date,
y = point.val,
color = label,
fill = label)) +
geom_point(shape = 21,
size = 3,
alpha = 0.5) +
ylab('number of weekly deaths') +
xlab('week ending date') +
scale_color_manual(
name = '',
values = darken(cols, 0.3)
) +
scale_fill_manual(
name = '',
values = cols
) +
theme_bw(16) +
theme(
legend.position = "top",
legend.justification = "right",
legend.text = element_text(size = 9),
legend.box.spacing = unit(0, "pt")
)
show(p.pred.sar)
rm(td.sar, pd.sar, whole_series)
While I considered fitting a SARIMA model for each state individually, a better strategy seemed to be to just fit a vector auto-regressive model for all of the state-level all cause mortality data simultaneously. This, hopefully, gives be a good fit for all of the time series and also allows the information from one time series to help with modeling the others. I included trend, intercept and seasonal components. VARselect seems to vacilate between a lag of 3 or 4 depending on the max lags tested. I went ahead and conservatively used a lag of 3 though it is worth noting that a lag of 4 may have a substantial effect on the outcome.
fits <- VARselect(train.ts,
lag.max = 7,
season = 52,
type = 'both')
print(fits$criteria)
## 1 2 3 4 5 6 7
## AIC(n) 3.531041e+02 3.511588e+02 3.388136e+02 NaN -Inf -Inf -Inf
## HQ(n) 3.796760e+02 3.908900e+02 3.917042e+02 NaN -Inf -Inf -Inf
## SC(n) 4.195446e+02 4.505030e+02 4.710617e+02 NaN -Inf -Inf -Inf
## FPE(n) 1.014868e+154 8.651973e+154 9.945331e+153 -4.752445e+42 0 0 0
final.var.fit <- VAR(train.ts,
p = 3,
season = 52,
type = 'both')
pred.var <- forecast(final.var.fit, h = 72)
rm(fits)
Here, I plot the VAR plot for just Texas. The plot shows the real deaths and the expected deaths based on model predictions. The real data includes COVID-related deaths so those will need to be removed.
td.var <- c(test.df %>% filter(jurisdiction_of_occurrence == 'Texas') %>% select(all_cause))$all_cause
pd.var <- clean_names(as_tibble(pred.var$forecast$Texas))$point_forecast
whole_series <- c(train.ts[,'Texas'], td.var, pd.var)
var.df <- tibble(date = c(seq.Date(from = min(training.df$week_ending_date),
by = 7,
length.out = length(train.ts[,'Texas'])),
seq.Date(from = min(test.df$week_ending_date),
by = 7,
length.out = length(td.var)),
seq.Date(from = max(training.df$week_ending_date) + 7,
by = 7,
length.out = length(pd.var))),
point.val = whole_series,
label = c(rep('real', length(train.ts[,'Texas']) + length(td.var)),
rep('model', length(pd.var))))
p.pred.var <- ggplot(var.df, aes(x = date,
y = point.val,
color = label,
fill = label)) +
geom_point(shape = 21,
size = 3,
alpha = 0.5) +
ylab('number of weekly deaths') +
xlab('week ending date') +
scale_color_manual(
name = '',
values = darken(cols, 0.3)
) +
scale_fill_manual(
name = '',
values = cols
) +
theme_bw(16) +
theme(
legend.position = "top",
legend.justification = "right",
legend.text = element_text(size = 9),
legend.box.spacing = unit(0, "pt")
)
show(p.pred.var)
rm(td.var, pd.var, whole_series)
Here, I calculate the values that we care about including the deaths over expected and the deaths over expected after removing COVID-related deaths.
tmp <- training.df %>% filter(jurisdiction_of_occurrence == 'Texas')
resid.tmp.var <- residuals(final.var.fit)
resid.tmp.sar <- residuals(final.sar.fit)
tmp.mod <- tibble(week_ending_date = tmp$week_ending_date[4:nrow(tmp)],
jurisdiction_of_occurrence = tmp$jurisdiction_of_occurrence[4:nrow(tmp)],
deaths_anomaly_var = resid.tmp.var[,'Texas'],
deaths_anomaly_sar = resid.tmp.sar[4:nrow(tmp)],
non_covid_deaths_anomaly_var = resid.tmp.var[,'Texas'],
non_covid_deaths_anomaly_sar = resid.tmp.sar[4:nrow(tmp)],
label = rep('model residual before 2020', length(resid.tmp.var[,'Texas'])))
mortality_working <- test.df %>% filter(jurisdiction_of_occurrence == 'Texas')
tmp.var <- clean_names(as_tibble(pred.var$forecast$Texas))
tmp.sar <- clean_names(as_tibble(pred.sarima))
mortality_working <- mortality_working %>%
mutate(deaths_anomaly_var = all_cause - tmp.var$point_forecast,
deaths_anomaly_var_upr = all_cause - tmp.var$hi_95,
deaths_anomaly_var_lwr = all_cause - tmp.var$lo_95,
non_covid_deaths_anomaly_var = deaths_anomaly_var - covid_19_u071_underlying_cause_of_death,
non_covid_deaths_anomaly_var_upr = deaths_anomaly_var_upr - covid_19_u071_underlying_cause_of_death,
non_covid_deaths_anomaly_var_lwr = deaths_anomaly_var_lwr - covid_19_u071_underlying_cause_of_death,
deaths_anomaly_sar = all_cause - tmp.sar$point_forecast,
deaths_anomaly_sar_upr = all_cause - tmp.sar$hi_95,
deaths_anomaly_sar_lwr = all_cause - tmp.sar$lo_95,
non_covid_deaths_anomaly_sar = deaths_anomaly_sar - covid_19_u071_underlying_cause_of_death,
non_covid_deaths_anomaly_sar_upr = deaths_anomaly_sar_upr - covid_19_u071_underlying_cause_of_death,
non_covid_deaths_anomaly_sar_lwr = deaths_anomaly_sar_lwr - covid_19_u071_underlying_cause_of_death,
label = 'model residual from 2020')
mortality_working <- tmp.mod %>% full_join(mortality_working)
rm(tmp, resid.tmp.var, resid.tmp.sar, tmp.mod, tmp.var, tmp.sar)
These are the plots of the non-COVID deaths over expected for the state of Texas. The data appear to show that there are more deaths than expected after removing COVID for much of the time after January 1st 2020 in the state. It is not clear why that might be the case.
I would also like to point out that the residuals of the real data are dramatically better with the VAR model than with SARIMA. VAR is used exclusively for all further analyses.
p.anom.sar <- ggplot(mortality_working, aes(x = week_ending_date,
y = non_covid_deaths_anomaly_sar,
color = label,
fill = label)) +
ylim(-1000, 1000) +
geom_point(shape = 21,
size = 3,
alpha = 0.5) +
ylab('number of weekly deaths over COVID') +
xlab('week ending date') +
scale_color_manual(
name = '',
values = darken(cols, 0.3)
) +
scale_fill_manual(
name = '',
values = cols
) +
theme_bw(12) +
theme(
legend.position = "top",
legend.justification = "right",
legend.text = element_text(size = 9),
legend.box.spacing = unit(0, "pt")
)
p.anom.var <- ggplot(mortality_working, aes(x = week_ending_date,
y = non_covid_deaths_anomaly_var,
color = label,
fill = label)) +
ylim(-1000, 1000) +
geom_point(shape = 21,
size = 3,
alpha = 0.5) +
ylab('number of weekly deaths over COVID') +
xlab('week ending date') +
scale_color_manual(
name = '',
values = darken(cols, 0.3)
) +
scale_fill_manual(
name = '',
values = cols
) +
theme_bw(12) +
theme(
legend.position = "top",
legend.justification = "right",
legend.text = element_text(size = 9),
legend.box.spacing = unit(0, "pt")
)
p.anom <- plot_grid(p.anom.sar, p.anom.var, ncol = 2, labels = NULL)
show(p.anom)
To make sure there isn’t some national underestimation as seen in the Texas data, I go back to the full VAR fit and use it to model the expected deaths for every state in the model. Then, I remove the deaths from COVID to leave excess deaths over expected.
tmp <- training.df %>% filter(mmwr_week > 3 | mmwr_year != 2014)
resid.tmp.var <- as_tibble(residuals(final.var.fit)) %>%
pivot_longer(names_to = 'jurisdiction_of_occurrence', values_to = 'deaths_anomaly_var', everything()) %>%
arrange(jurisdiction_of_occurrence) %>%
mutate(non_covid_deaths_anomaly_var = deaths_anomaly_var)
tmp <- tmp %>% bind_cols(resid.tmp.var %>% select(deaths_anomaly_var, non_covid_deaths_anomaly_var)) %>%
mutate(point_forecast = all_cause + deaths_anomaly_var)
tmp.var <- clean_names(tk_tbl(pred.var))
colnames(tmp.var) <- c('week_ending_date', 'jurisdiction_of_occurrence', 'point_forecast', 'lo_80', 'hi_80', 'lo_95', 'hi_95')
tmp.var$week_ending_date <- test.df$week_ending_date
tmp.test <- test.df %>% left_join(tmp.var)
tmp <- tmp %>% full_join(tmp.test)
mortality_working <- tmp %>%
mutate(deaths_anomaly_var = all_cause - point_forecast,
deaths_anomaly_var_upr = all_cause - hi_95,
deaths_anomaly_var_lwr = all_cause - lo_95,
non_covid_deaths_anomaly_var = deaths_anomaly_var - covid_19_u071_underlying_cause_of_death,
non_covid_deaths_anomaly_var_upr = deaths_anomaly_var_upr - covid_19_u071_underlying_cause_of_death,
non_covid_deaths_anomaly_var_lwr = deaths_anomaly_var_lwr - covid_19_u071_underlying_cause_of_death) %>%
mutate(covid_19_u071_underlying_cause_of_death = replace_na(covid_19_u071_underlying_cause_of_death, 0))
rm(tmp, resid.tmp.var, tmp.var, tmp.test)
First, I plot the real number of weekly deaths in black and overlay the expected number of weekly deaths in dark red. That is, I keep COVID deaths in the plot initially. As is obvious, there are several states that have significant deviations from the expectation. In the next plot, I remove COVID deaths for all states.
p.var.all <- ggplot(mortality_working) +
geom_point(aes(x = week_ending_date,
y = all_cause),
color = 'black',
alpha = 0.3) +
geom_point(aes(x = week_ending_date,
y = point_forecast),
color = 'darkred',
alpha = 0.3) +
xlab('week ending date') +
ylab('number of weekly deaths') +
theme_bw(12) +
facet_wrap(~jurisdiction_of_occurrence, ncol = 4)
show(p.var.all)
Below, I again show the real data in black, but now with the COVID deaths removed. In dark red I still show the expectation from the VAR model. Therefore, this shows deaths in excess of expectation after removing COVID deaths. This clearly demonstrates that the expectation fits are quite good if COVID deaths are removed. However there are still a few deviations. There may be COVID deaths that aren’t correctly counted or simply aren’t yet counted in the data; this may be happening in California. Larger states might be expected to have larger absolute numbers of misses. Or, in the case of North Carolina, there appears to be something entirely broken about the data.
p.var.noncovid <- ggplot(mortality_working) +
geom_point(aes(x = week_ending_date,
y = all_cause - covid_19_u071_underlying_cause_of_death),
color = 'black',
alpha = 0.3) +
geom_point(aes(x = week_ending_date,
y = point_forecast),
color = 'darkred',
alpha = 0.3) +
xlab('week ending date') +
ylab('number of weekly deaths over COVID') +
theme_bw(12) +
facet_wrap(~jurisdiction_of_occurrence, ncol = 4)
show(p.var.noncovid)
I show the absolute number of non-COVID deaths in excess of expectation. Essentially, this is subtracting the dark red points from the black points above. I only show the data for the start of 2021 to get a sense for the nationwide-wide data. Several states show intermittent deviations from the expectation. North Carolina is, again, broken. California has significant residual deviation as does Florida, Texas, South Carolina and West Virginia at times.
p.var.noncovid.feb <- ggplot(mortality_working %>%
filter(week_ending_date >= '2021-01-01' &
week_ending_date < '2021-04-15')) +
geom_lollipop(aes(x = week_ending_date,
y = all_cause - covid_19_u071_underlying_cause_of_death - point_forecast),
color = '#000000',
fill = '#AAAAAA',
alpha = 0.8,
point.size = 2.5,
shape = 21,
size = 1.1) +
xlab('week ending date') +
ylab('excess weekly deaths') +
ylim(-2000, 2000) +
theme_bw(12) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
facet_wrap(~jurisdiction_of_occurrence, ncol = 4)
show(p.var.noncovid.feb)
Finally, I narrow down to the largest four states to get a sense for the deviations that are possible in large states. Unfortunately, selecting neighboring states is probably not the right approach. Given that the values are absolute deaths in excess of expectation, the analysis could end up being biased by the size of the state. Larger states would be expected to have larger numbers of deaths incorrectly categorized.
I think I agree with BuzzFeed. Based on these data, there probably were more than the 100-200 deaths reported by state agencies in the final two weeks of February 2021 during the Winter Storm. However, I think the number of deaths is probably closer to 500 rather than 700, but it is hard to know.
p.var.noncovid.four.feb <- ggplot(mortality_working %>%
filter(jurisdiction_of_occurrence %in% c('California',
'Texas',
'NewYork',
'Florida') &
week_ending_date >= '2021-01-01' &
week_ending_date < '2021-04-15')) +
geom_lollipop(aes(x = week_ending_date,
y = all_cause - covid_19_u071_underlying_cause_of_death - point_forecast),
color = '#000000',
fill = '#AAAAAA',
alpha = 0.8,
point.size = 2.5,
shape = 21,
size = 1.1) +
xlab('week ending date') +
ylab('excess weekly deaths') +
ylim(-800, 800) +
theme_bw(12) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
facet_wrap(~jurisdiction_of_occurrence, ncol = 5)
show(p.var.noncovid.four.feb)